# About the Data

These datasets report votes counted for U.S. president on the night of the November 5, 2024 general election, and in the days and weeks after. The source of the data is the Edison Feed as reported by CNN. The data were assembled by downloading and reshaping data files that CNN hosts publicly as part of its web-based election reporting.

This release includes two datasets:

* `county_ENR_president.csv`: a county-level dataset of every observation we took of vote counts for president. This dataset is *more granular* but *more raw*. These data directly reflect what CNN's files said at a given time. They may not exactly match the official vote counts of state and local governments at each time stamp.
* `state_ENR_president.csv` is a state-level dataset. This dataset is *less granular* but *more quality assured*. We ensured that the data do not include large drops in vote counts, which may be induced by the interaction between news agencies' file management practices and our data collection methods. This dataset includes every state, but not Washington, D.C.

Crucially, **these data are not official vote counts**. They are **a mirror of internal documents which support news reporting about official vote counts**. Any issues found in these datasets are likely not an issue with vote counting, but only an imperfection in the mirror.

## Variables and values
`county_ENR_president.csv` is a county-candidate-time-level dataset containing the following variables and types:

* `state` (`str`): The two-letter Postal Service abbreviation for each state.
* `cnn_code` (`str`): A numerical county code, provided by CNN. For most states, this is the county that administered the election. In the New England states (Connecticut, Massachusetts, Maine, New Hampshire, Rhode Island, and Vermont), it is an alphabetically sequential numerical identifier.
* `jurisdiction_fips` (`str`): The FIPS code of a county or administering jurisdiction. In every state except New England and Alaska, the code appears to be a standard county FIPS. In Alaska, the code appears to represent the 40 House districts of Alaska. In New England, we imputed municipal-level FIPS codes by matching the sequential CNN codes to an alphabetical sequence of municipality names, and identifying any discrepancies according to their vote counts.
* `office` (`str`): The public office being contested. In this dataset, the value is always `president`.
* `candidate` (`str`): The last name of the candidate contesting the office, in camel case.
* `party` (`str`): The candidate's party, in camel case.
* `votes` (`int`): The total number of votes that had been counted for that candidate up to that time.
* `reported` (`int`): The percent of the vote reported in the jurisdiction at that time.
* `total` (`int`): The total number of votes counted in the jurisdiction up to that time.
* `share` (`float`): Each candidate's share of the vote in the jurisdiction at that time.
* `percent` (`int`): Each candidate's percent in the jurisdiction at that time. This is `share` multiplied by 100 and then rounded to the nearest natural.
* `datetime` (`datetime`): The date and time at which the vote count file was downloaded, in 24-hour `YYYY-MM-DD hh:mm:ss` format. Note that this matches the variable `time` in `county_ENR_president.csv`.
* `day` (`str`): The day on which the vote count file was downloaded, in `YYYY-MM-DD` format. This is extracted from `datetime` for ease of use.
* `time` (`time`): The time at which the vote count file was downloaded, in 24-hour `hh:mm` format. This is extracted from `datetime` for ease of use. Note that this does not match the variable `time` in `county_ENR_president.csv`.

`state_ENR_president.csv`, in addition to being less granular in space, also has fewer variables. This state-candidate-time-level dataset contains the following variables and types:

* `state` (`str`): The two-letter Postal Service abbreviation for each state.
* `candidate` (`str`): The last name of the candidate contesting the office, in camel case.
* `time` (`datetime`): The date and time at which the vote count file was downloaded, in 24-hour `YYYY-MM-DD hh:mm:ss` format. Note that this matches the variable `datetime` in `county_ENR_president.csv`.
* `votes` (`int`): The total number of votes that had been counted for that candidate up to that time.
* `voteDiff` (`int`): The number of votes counted for that candidate in that jurisdiction at timestamp t<sub>1</sub> minus the number of votes counted for that candidate in that jurisdiction at the preceding timestamp t<sub>0</sub>, defined at 0 in the initial timestamp.

## Quality Assurance Checks
There is a known issue in the county-level dataset: too many votes reported for each of Trump and Harris. The number of votes reported for Trump in the final timestamp in `county_ENR_president.csv` is 77,473,669, and the number reported for Harris at the last timestamp is 75,534,167; both totals are larger than the official numbers.

```
> county <- read.csv('county_ENR_president.csv')
> fin <- county[county$datetime == max(county$datetime),]
> sum(fin$votes[fin$candidate == 'Trump'])
[1] 77473669
> sum(fin$votes[fin$candidate == 'Harris'])
[1] 75534167
```

This is larger than the actual number of votes, not because of any actual issue with vote-counting, but, we believe, for reasons induced by the file management structure of the news agency from which the data were collected. We have repaired these issues at the state level as part of aggregating `county_ENR_president.csv` to form `state_ENR_president.csv`:

```
> state <- read.csv('state_ENR_president.csv')
> fstate <- state[state$time == max(state$time),]
> sum(fstate$votes[fstate$candidate == 'Trump'])
[1] 77125904
> sum(fstate$votes[fstate$candidate == 'Harris'])
[1] 74415138
```

These sums of state-level numbers are closer to the candidates' final official vote totals (and indeed, these totals are slightly smaller than the official totals, since a small share of votes was counted after our data collection period).

`county_ENR_president.csv` may also include the appearance of decreases in vote county from one timestamp to the next. This again should not be taken as evidence of any issue in the vote-counting itself, but is more likely the result of internal news agency file management practices. The largest such drops were resolved when aggregating to `state_ENR_president.csv`, but some small decreases may remain.

The following figure illustrates the trajectory of state vote counts in our dataset. Each curve represents the vote count at a successive time, with varying time intervals on the x-axis (for a more precise approach, see [1]). Note that these curves, to visual appearances, (non-strictly) monotonically increase. While we did not rid the data of every decrase in vote counts from one time to point to the next, all remaining drops are substantively small.

![](pres_votes.png "Presidential votes over time")

# References
[1] Baltz, Samuel. December 6, 2024. ``How long did it take to count the vote in 2024?'' <i>MIT Election Data and Science Lab Blog</i>. <a>https://electionlab.mit.edu/articles/how-long-did-it-take-count-vote-2024</a>.
